Data Loading and Quality Check¶
information about some column values
Open : an indicator for whether the store was open: 0 = closed, 1 = open
StateHoliday : indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
StoreType : differentiates between 4 different store models: a, b, c, d
Promo2 : Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Assortment : describes an assortment level: a = basic, b = extra, c = extended
import pandas as pd
from utils.loaders import load_config
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.io as pio
pio.renderers.default = "notebook"
conf = load_config("../config/config.yaml")
opening file
df = pd.read_csv(conf["data"]["train"],parse_dates=['Date'])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1017209 entries, 0 to 1017208 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Store 1017209 non-null int64 1 DayOfWeek 1017209 non-null int64 2 Date 1017209 non-null datetime64[ns] 3 Sales 1017209 non-null int64 4 Customers 1017209 non-null int64 5 Open 1017209 non-null int64 6 Promo 1017209 non-null int64 7 StateHoliday 1017209 non-null object 8 SchoolHoliday 1017209 non-null int64 dtypes: datetime64[ns](1), int64(7), object(1) memory usage: 69.8+ MB
C:\Users\ahmed\AppData\Local\Temp\ipykernel_17168\4188774538.py:1: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
- we have mixed types in some columns
for col in df.columns:
inferred_type = pd.api.types.infer_dtype(df[col])
print(f"Column '{col}': Inferred type is '{inferred_type}'")
Column 'Store': Inferred type is 'integer' Column 'DayOfWeek': Inferred type is 'integer' Column 'Date': Inferred type is 'datetime64' Column 'Sales': Inferred type is 'integer' Column 'Customers': Inferred type is 'integer' Column 'Open': Inferred type is 'integer' Column 'Promo': Inferred type is 'integer' Column 'StateHoliday': Inferred type is 'mixed-integer' Column 'SchoolHoliday': Inferred type is 'integer'
df["StateHoliday"] = df["StateHoliday"].astype(str)
store_df = pd.read_csv(conf["data"]["store"])
store_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1115 entries, 0 to 1114 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Store 1115 non-null int64 1 StoreType 1115 non-null object 2 Assortment 1115 non-null object 3 CompetitionDistance 1112 non-null float64 4 CompetitionOpenSinceMonth 761 non-null float64 5 CompetitionOpenSinceYear 761 non-null float64 6 Promo2 1115 non-null int64 7 Promo2SinceWeek 571 non-null float64 8 Promo2SinceYear 571 non-null float64 9 PromoInterval 571 non-null object dtypes: float64(5), int64(2), object(3) memory usage: 87.2+ KB
merged_df = pd.merge(df,store_df,on="Store", how='inner')
merged_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1017209 entries, 0 to 1017208 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Store 1017209 non-null int64 1 DayOfWeek 1017209 non-null int64 2 Date 1017209 non-null datetime64[ns] 3 Sales 1017209 non-null int64 4 Customers 1017209 non-null int64 5 Open 1017209 non-null int64 6 Promo 1017209 non-null int64 7 StateHoliday 1017209 non-null object 8 SchoolHoliday 1017209 non-null int64 9 StoreType 1017209 non-null object 10 Assortment 1017209 non-null object 11 CompetitionDistance 1014567 non-null float64 12 CompetitionOpenSinceMonth 693861 non-null float64 13 CompetitionOpenSinceYear 693861 non-null float64 14 Promo2 1017209 non-null int64 15 Promo2SinceWeek 509178 non-null float64 16 Promo2SinceYear 509178 non-null float64 17 PromoInterval 509178 non-null object dtypes: datetime64[ns](1), float64(5), int64(8), object(4) memory usage: 139.7+ MB
train_df = merged_df[merged_df["Open"] == 1]
train_df.head(20)
| Store | DayOfWeek | Date | Sales | Customers | Open | Promo | StateHoliday | SchoolHoliday | StoreType | Assortment | CompetitionDistance | CompetitionOpenSinceMonth | CompetitionOpenSinceYear | Promo2 | Promo2SinceWeek | Promo2SinceYear | PromoInterval | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 5 | 2015-07-31 | 5263 | 555 | 1 | 1 | 0 | 1 | c | a | 1270.0 | 9.0 | 2008.0 | 0 | NaN | NaN | NaN |
| 1 | 2 | 5 | 2015-07-31 | 6064 | 625 | 1 | 1 | 0 | 1 | a | a | 570.0 | 11.0 | 2007.0 | 1 | 13.0 | 2010.0 | Jan,Apr,Jul,Oct |
| 2 | 3 | 5 | 2015-07-31 | 8314 | 821 | 1 | 1 | 0 | 1 | a | a | 14130.0 | 12.0 | 2006.0 | 1 | 14.0 | 2011.0 | Jan,Apr,Jul,Oct |
| 3 | 4 | 5 | 2015-07-31 | 13995 | 1498 | 1 | 1 | 0 | 1 | c | c | 620.0 | 9.0 | 2009.0 | 0 | NaN | NaN | NaN |
| 4 | 5 | 5 | 2015-07-31 | 4822 | 559 | 1 | 1 | 0 | 1 | a | a | 29910.0 | 4.0 | 2015.0 | 0 | NaN | NaN | NaN |
| 5 | 6 | 5 | 2015-07-31 | 5651 | 589 | 1 | 1 | 0 | 1 | a | a | 310.0 | 12.0 | 2013.0 | 0 | NaN | NaN | NaN |
| 6 | 7 | 5 | 2015-07-31 | 15344 | 1414 | 1 | 1 | 0 | 1 | a | c | 24000.0 | 4.0 | 2013.0 | 0 | NaN | NaN | NaN |
| 7 | 8 | 5 | 2015-07-31 | 8492 | 833 | 1 | 1 | 0 | 1 | a | a | 7520.0 | 10.0 | 2014.0 | 0 | NaN | NaN | NaN |
| 8 | 9 | 5 | 2015-07-31 | 8565 | 687 | 1 | 1 | 0 | 1 | a | c | 2030.0 | 8.0 | 2000.0 | 0 | NaN | NaN | NaN |
| 9 | 10 | 5 | 2015-07-31 | 7185 | 681 | 1 | 1 | 0 | 1 | a | a | 3160.0 | 9.0 | 2009.0 | 0 | NaN | NaN | NaN |
| 10 | 11 | 5 | 2015-07-31 | 10457 | 1236 | 1 | 1 | 0 | 1 | a | c | 960.0 | 11.0 | 2011.0 | 1 | 1.0 | 2012.0 | Jan,Apr,Jul,Oct |
| 11 | 12 | 5 | 2015-07-31 | 8959 | 962 | 1 | 1 | 0 | 1 | a | c | 1070.0 | NaN | NaN | 1 | 13.0 | 2010.0 | Jan,Apr,Jul,Oct |
| 12 | 13 | 5 | 2015-07-31 | 8821 | 568 | 1 | 1 | 0 | 0 | d | a | 310.0 | NaN | NaN | 1 | 45.0 | 2009.0 | Feb,May,Aug,Nov |
| 13 | 14 | 5 | 2015-07-31 | 6544 | 710 | 1 | 1 | 0 | 1 | a | a | 1300.0 | 3.0 | 2014.0 | 1 | 40.0 | 2011.0 | Jan,Apr,Jul,Oct |
| 14 | 15 | 5 | 2015-07-31 | 9191 | 766 | 1 | 1 | 0 | 1 | d | c | 4110.0 | 3.0 | 2010.0 | 1 | 14.0 | 2011.0 | Jan,Apr,Jul,Oct |
| 15 | 16 | 5 | 2015-07-31 | 10231 | 979 | 1 | 1 | 0 | 1 | a | c | 3270.0 | NaN | NaN | 0 | NaN | NaN | NaN |
| 16 | 17 | 5 | 2015-07-31 | 8430 | 946 | 1 | 1 | 0 | 1 | a | a | 50.0 | 12.0 | 2005.0 | 1 | 26.0 | 2010.0 | Jan,Apr,Jul,Oct |
| 17 | 18 | 5 | 2015-07-31 | 10071 | 936 | 1 | 1 | 0 | 1 | d | c | 13840.0 | 6.0 | 2010.0 | 1 | 14.0 | 2012.0 | Jan,Apr,Jul,Oct |
| 18 | 19 | 5 | 2015-07-31 | 8234 | 718 | 1 | 1 | 0 | 1 | a | c | 3240.0 | NaN | NaN | 1 | 22.0 | 2011.0 | Mar,Jun,Sept,Dec |
| 19 | 20 | 5 | 2015-07-31 | 9593 | 974 | 1 | 1 | 0 | 0 | d | a | 2340.0 | 5.0 | 2009.0 | 1 | 40.0 | 2014.0 | Jan,Apr,Jul,Oct |
missing_values =train_df.isnull().sum()
percentage_missing = (missing_values / len(train_df))*100
percentage_missing
Store 0.000000 DayOfWeek 0.000000 Date 0.000000 Sales 0.000000 Customers 0.000000 Open 0.000000 Promo 0.000000 StateHoliday 0.000000 SchoolHoliday 0.000000 StoreType 0.000000 Assortment 0.000000 CompetitionDistance 0.258884 CompetitionOpenSinceMonth 31.812120 CompetitionOpenSinceYear 31.812120 Promo2 0.000000 Promo2SinceWeek 50.131574 Promo2SinceYear 50.131574 PromoInterval 50.131574 dtype: float64
px.box(train_df,x="Sales")
Conclusion:
- Promo2SinceWeek,Promo2SinceYear, PromoInterval have
50%missing values ~ analyze their importance to exclude or impute - All columns have correct types
- we have some outliers with the extreme one being >
40kbut it is likely caused by promotions ~ needs to be checked
Data Analysis for Target Variable¶
train_df["Month"] = train_df["Date"].dt.month
train_df["Year"] = train_df["Date"].dt.year
C:\Users\ahmed\AppData\Local\Temp\ipykernel_17168\2692207235.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\ahmed\AppData\Local\Temp\ipykernel_17168\2692207235.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
train_df.groupby(["StoreType"]).Store.count()
StoreType a 457077 b 15563 c 112978 d 258774 Name: Store, dtype: int64
- Stores of type b are lower than other type of store
train_df.groupby(["StoreType","Assortment"]).Store.count()
StoreType Assortment
a a 286053
c 171024
b a 6409
b 8212
c 942
c a 58561
c 54417
d a 93886
c 164888
Name: Store, dtype: int64
- Stores of type b are the only ones with assortment type b ( extra )
px.histogram(train_df,x="Sales")
- Daily sales are concentrated in the 3k-8k range, but the distribution is right-skewed with extrem peaks > 20 k up to ~40k likely driven by promotions
avg_sales_day_df = train_df.groupby(["Date"]).Sales.mean().reset_index()
px.line(avg_sales_day_df, x="Date", y="Sales")
- Sales follow a predictable weekly cycle, with weekends (and especially December) showing peaks. A sharp drop occurs each January, consistent with post-holiday demand normalization. Promotional spikes occur throughout the year, amplifying variability
px.box(train_df,x="Sales")
- Most daily sales fall between ~4.5k and 8k (IQR), with a median near 6k. However, extreme outliers are present, including one day >40k, likely linked to exceptional promotional activity.”
px.box(train_df,x="Sales", color="DayOfWeek")
- Sales show a clear weekly pattern: weekdays are stable but lower, while Sundays and Mondays drive significantly higher sales, albeit with more variability. This suggests staffing and inventory planning should be adjusted to meet peak demand at the start and end of the week.
avg_month_sales = train_df.groupby(["Year","Month"]).Sales.mean().reset_index()
px.line(avg_month_sales,x="Month",y="Sales", color="Year")
- we can notice a clear cycle , sales follow a strong annual cycle: demand peaks in decembre and drops in january , mid-year sales remain low and stable, this implies seasonality
avg_sales_Promo_df = train_df.groupby(["Promo"]).Sales.mean().reset_index()
px.bar(avg_sales_Promo_df,y="Sales",x="Promo",text_auto=".2f")
promo_uplift = ((avg_sales_Promo_df.iloc[1] - avg_sales_Promo_df.iloc[0]) / avg_sales_Promo_df.iloc[0] ) * 100
promo_uplift
Promo inf Sales 38.770714 dtype: float64
- Promotions are highly effective, lifting sales by ~38% on average compared to normal days.
promo_sales_store = train_df.groupby(["StoreType","Promo"])["Sales"].mean().unstack()
promo_sales_store["uplift_pct"] = (promo_sales_store[1] - promo_sales_store[0]) / promo_sales_store[0] * 100
promo_sales_store
| Promo | 0 | 1 | uplift_pct |
|---|---|---|---|
| StoreType | |||
| a | 5808.500091 | 8304.102317 | 42.964658 |
| b | 9566.861686 | 11307.994108 | 18.199619 |
| c | 6028.063255 | 8041.995526 | 33.409276 |
| d | 5855.268289 | 8017.828656 | 36.933583 |
- Promotions are most effective in StoreType a, while StoreType b already drives high sales with only modest promo uplift. This suggests tailoring promo spend: invest more in a/c/d, and rethink promo strategy in b where uplift is limited
promo_sales_month = train_df.groupby(["Month","Promo"])["Sales"].mean().unstack()
promo_sales_month["uplift_pct"] = (promo_sales_month[1] - promo_sales_month[0]) / promo_sales_month[0] * 100
promo_sales_month
| Promo | 0 | 1 | uplift_pct |
|---|---|---|---|
| Month | |||
| 1 | 5591.904236 | 7782.800277 | 39.179785 |
| 2 | 5747.812834 | 7772.909263 | 35.232470 |
| 3 | 5805.723545 | 8252.983345 | 42.152538 |
| 4 | 5855.305439 | 8426.029528 | 43.904184 |
| 5 | 6300.749699 | 8087.946860 | 28.364833 |
| 6 | 5871.747172 | 8508.684577 | 44.908906 |
| 7 | 5751.911514 | 8257.588809 | 43.562515 |
| 8 | 5680.220869 | 7926.496229 | 39.545564 |
| 9 | 5528.634084 | 7916.399032 | 43.189057 |
| 10 | 5845.745508 | 7646.399697 | 30.802815 |
| 11 | 6257.749173 | 8163.897052 | 30.460599 |
| 12 | 7318.434182 | 10490.790244 | 43.347470 |
avg_sales_day_Promo_df = train_df.groupby(["Date","Promo"]).Sales.mean().reset_index()
px.line(avg_sales_day_Promo_df,x="Date",y="Sales", color="Promo")
- Promos are consistently effective year-round, but their relative impact is strongest in spring/summer (when baseline demand is lower) and in December (amplifying the holiday peak). October–November promos show diminishing returns
avg_sales_promo2_df = train_df.groupby(["Promo2"]).Sales.mean().reset_index()
px.bar(avg_sales_promo2_df,y="Sales",x="Promo2",text_auto=".2f")
avg_sales_school_df = train_df.groupby(["SchoolHoliday"]).Sales.mean().reset_index()
px.bar(avg_sales_school_df,y="Sales",x="SchoolHoliday",text_auto=".2f")
- School Holiday has little to no impact on sales
avg_sales_state_df = train_df.groupby(["StateHoliday"]).Sales.mean().reset_index()
px.bar(avg_sales_state_df,y="Sales",x="StateHoliday",text_auto=".2f")
baseline = avg_sales_state_df[avg_sales_state_df["StateHoliday"] == "0"].Sales
avg_sales_state_df.Sales.apply( lambda x: (x - baseline)/ baseline * 100)
| 0 | |
|---|---|
| 0 | 0.000000 |
| 1 | 22.060154 |
| 2 | 42.199875 |
| 3 | 40.126920 |
Public holidays increase sales by
~22%compared to normal days.Easter and Christmas are the biggest demand drivers
(+40–43%).
avg_sales_store_df = train_df.groupby(["StoreType"]).Sales.mean().reset_index()
px.bar(avg_sales_store_df,y="Sales",x="StoreType",text_auto=".2f")
- Store Type b has the highest sales compared to others, why ?
#CompetitionDistance ranges from very small (tens of meters) up to huge numbers (millions of meters) log scale is needed to realistically capture the change
train_df["CompDist_log"] = np.log1p(train_df["CompetitionDistance"])
C:\Users\ahmed\AppData\Local\Temp\ipykernel_17168\3548370335.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
distance_from_competition = train_df.groupby(["StoreType"]).CompDist_log.sum().reset_index()
px.bar(distance_from_competition, y="CompDist_log", x="StoreType", text_auto=".2f")
- StoreType b stores are located in high-traffic retail clusters with (more customers) competitors nearby
assor_store = train_df.groupby(["StoreType","Assortment"]).Sales.mean().reset_index()
px.bar(assor_store,x="StoreType",y="Sales", color="Assortment")
- StoreType b consistently outperforms others in daily sales, driven by broader assortments (extended/extra). Other store types (a/c/d) perform similarly, with assortment mix making only modest differences.
temp = train_df.groupby(["StoreType"]).Sales.sum().reset_index()
px.bar(temp,x="StoreType",y="Sales")
- While Stores type b have higher sales it does not contribute to the overall sales since the number of type b stores are low
avg_sales_dist_df = train_df.groupby(["CompDist_log"]).Sales.mean().reset_index()
px.scatter(avg_sales_dist_df, x="CompDist_log", y="Sales")
- CompetitionDistance has little predictive power on its own stores can succeed both near and far from competitors. This suggests that store type, assortment, and promotions are much stronger drivers of sales than competitor proximity
avg_sales_assor_df = train_df.groupby(["Assortment"]).Sales.mean().reset_index()
px.bar(avg_sales_assor_df,y="Sales",x="Assortment",text_auto=".2f")
- Assortment b gets more sales than other
px.scatter(train_df,x="Customers",y="Sales", color="StoreType", opacity=0.5)
- StoreType d achieves higher sales per customer, while Type b relies on higher traffic but with lower spend per visit.
avg_basket_size = train_df.groupby("StoreType").apply(lambda x : (x["Sales"]/x["Customers"]).mean())
print(avg_basket_size)
StoreType a 8.846277 b 5.133097 c 8.626227 d 11.277862 dtype: float64
C:\Users\ahmed\AppData\Local\Temp\ipykernel_17168\66509408.py:1: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.